Skip to content

Add a 4-digit SWAR follow-up to loop_parse_if_eight_digits (clang)#382

Open
fcostaoliveira wants to merge 1 commit into
fastfloat:mainfrom
redis-performance:pr/four-digit-followup
Open

Add a 4-digit SWAR follow-up to loop_parse_if_eight_digits (clang)#382
fcostaoliveira wants to merge 1 commit into
fastfloat:mainfrom
redis-performance:pr/four-digit-followup

Conversation

@fcostaoliveira
Copy link
Copy Markdown
Contributor

After the 8-digit SWAR block loop in loop_parse_if_eight_digits, a remaining
run of 4–7 digits is currently finished one byte at a time by the caller. This adds
a single 4-digit SWAR step (reusing the existing read4_to_u32 /
is_made_of_four_digits_fast / parse_four_digits_unrolled helpers, already used
elsewhere in the file) to consume four of those digits at once. The parsed result is
identical — it is purely a faster way to consume the same digits.

Benchmark — m8g.metal-24xl (Graviton4 / Neoverse V2), -O3 -march=native,
simple_fastfloat_benchmark, from_charsdouble, clang 18, base vs patch
measured back-to-back (reproduced across 3 runs):

dataset clang 18
canada.txt +11.7%
mesh.txt +7.7%
random [0,1] ~0%

About the #if defined(__clang__) gate

I gated this to clang, and I want to be upfront about why rather than hide it. On
gcc the extra 4-digit check regresses inputs whose remainder is shorter than 4
digits (e.g. the 17-digit fraction of uniform [0,1]: gcc spends the check but never
takes it, ~−3% on random), while clang shows no such regression and a large gain on
the mid-length fractions in canada/mesh. Gating keeps the clang win with zero
regression on either compiler, but it does introduce a compiler-specific branch.

I'm happy to do whatever you prefer here: drop the gate (accepting the gcc random
regression), keep it gated, rework it into a form that helps both, or drop the PR if
you'd rather not carry a compiler-specific path. Numbers and reasoning are all here so
you can decide.

Correctness

FASTFLOAT_TEST (14/14) + supplemental corpus + the randomized random_string /
short_random_string tests pass under clang; an exhaustive float32 sweep
(FASTFLOAT_EXHAUSTIVE) is running. Clean C++11/C++20 under
-Werror -Wall -Wextra -Weffc++ -Wconversion -Wsign-conversion; clang-format clean.
No multi-byte reads beyond the existing endian-safe read4_to_u32, so big-endian is
unaffected.

After the 8-digit SWAR block loop, consume a remaining 4-7 digit run in one
read4_to_u32 + parse_four_digits_unrolled step instead of byte-by-byte (reusing
the existing 4-digit helpers). The parsed result is identical; this is purely a
faster way to consume the same digits.

Gated to clang: on gcc the extra 4-digit check regresses inputs whose remainder
is < 4 digits (e.g. the 17-digit fraction of uniform [0,1] -> -3% on 'random'),
because the check becomes pure overhead there; clang does not show that.

m8g.metal-24xl (Graviton4), -O3 -march=native, simple_fastfloat_benchmark,
from_chars->double, clang 18, base vs patch back-to-back (2 samples):
  canada.txt +11.7%, mesh.txt +7.4%, random ~flat. No regression.
@fcostaoliveira
Copy link
Copy Markdown
Contributor Author

Exhaustive validation finished: all five float32 sweeps pass under clang against this branch — exhaustive32, exhaustive32_64, exhaustive32_midpoint, long_exhaustive32, long_exhaustive32_64 (5/5, 0 failures).

Copy link
Copy Markdown
Member

@lemire lemire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us merge once the tests are green.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants